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RESULTS FOR RETAIL1-16 NEURAL NODES 
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BABY PRODS Kmap: darkest shade - 42% market 1 


File View Sort By Exit 


U] -weight 0.201 [2] ..weight 0.138 
02% Vol R 01% Vol R 
mean 0.87 0.33 mean 0.44 0.17 
meanlO 12.12 0.85 meanlO 8.98 0.63 
partcp 0.07 0.38 partcp 0.05 0.26 


3 ..weight 0.283 ! 
01% Val R 
mean 1.02 0.38 
meanlO 9.56 0.67 
partcp 0.11 0.57 


\T\ ..weight 0.212 [s] ..weight 0.093 [9] ..weight 0.267 
01% Vol R 00% Vol R 01% Val R 
mean 0.9 0.34 mean 0.35 0.13 mean 1.13 0.43 
meanlO 11.86 0.83 meanlO 10.23 0.72 meanlO 10.39 0.73 
partcp 0.08 0.41 partcp 0.03 0.18 partcp 0.11 0.38 


QH ..weight 0.165 
01% Val R 
mean 0.66 0.25 
meanlO 8.96 0.63 
partcp 0.07 0.4 


EH ..weight 0.172 M -weight 0.193 
01% Val R 00% Val R 
mean 0.77 0.29 mean 0.62 0.23 
meanlO 11.13 0.78 meanlO 7.37 0.52 
partcp 0.07 0.37 partcp 0.08 0.45 


(lH ..weight 0.148 
01% Val R 
mean 1.14 0.43 
meanlO 8 0.56 
partcp 0.14 0.77 


M -weight 0.173 
01% Val R 
mean 1.08 0.41 
meanlO 8.43 0.59 
partcp 0.13 0.69 


M -weight 0.213 
02% Val R ' 
mean 1.54 0.58 
meanlO 10.79 0.76 
partcp 0.14 0.76 ■ 


|25] ..weight 0.28 M -weight 0.239 |27] ..weight 0.398 
03% Val R 02% Val R 04% Val R 
mean 3.57 1.35 mean 1.8 0.68 mean 4.31 1.63 
meanlO 15.84 1.11 meanlO 11.13 0.78 meanlO 13.18 0.93 
partcp 0.23 1.21 partcp 0.16 0.87 partcp 0.33 1.76 


53 ..weight 0.321 
03% Val R 
mean 3.01 1.14 
meanlO 12.9 0.91 
partcp 0.23 1.25 


|32) ..weight 0.254 
02% Val R 
mean 1.81 0.68 
meanlO 8.68 0.61 
partcp 0.21 1.12 
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H ..weight 0.28 
01% Val R 
mean 0.85 0.32 
i meanIO 6.53 0.46 
partcp 0.13 0.7 


[H ..weight 0.173 
01% Val R 
mean 0.61 0.23 
meanIO 6.42 0.45 
partcp 0.1 0.51 


6 ..weight 0.252 
02% Val R 
mean 1.77 0.67 
meanIO 9.57 0.67 
partcp 0.18 0.99 


[3 ..weight 0.13 
00% Val R 
mean 0.46 0.17 
meanIO 6.37 0.45 
partcp 0.07 0.38 


E ..weight 0.232 
01% Val R 
mean 0.99 0.37 
meanIO 8.09 0.57 
partcp 0.12 0.66 


[F| ..weight 0.196 
02% Val R 
mean 1.37 0.52 
meanIO 7.9 0.56 
partcp 0.17 0.93 


EH ..weight 0.341 (TT] ..weight 0.217 
02% Val R 01% Val R 
mean 1.94 0.73 mean 1.31 0.49 
meanIO 11.94 0.84 meanIO 7.16 0.5 
partcp 0.16 0.87 partcp 0.18 0.98 


HI ..weight 0.122 
01% Val R 
mean 0.83 0.31 
meanIO 5.85 0.41 
partcp 0.14 0.76 


iH ..weight 0.21 
'01% Val R 
mean 1.29 0.49 
meanIO 8.78 0.62 
i partcp 0.15 0.79 


(23] ..weight 0.176 
01% Val R 
mean 0.82 0.31 
meanIO 5.58 0.39 
partcp 0.15 0.79 


M -weight 0.142 
01% Val R 
mean 0.83 0.31 
meanIO 5.62 0.4 
partcp 0.15 0.8 


HD ..weight 0.229 
01% Val R 
mean 1.28 0.49 
. meanIO 4.77 0.34 
partcp 0.27 1.44 


fra] ..weight 0.374 
04% Val R 
mean 3.95 1.49 
meanIO 12.1 0.85 
partcp 0.33 1.75 


|30) ..weight 0.113 
01% Val R 
mean 0.56 0.21 
meanIO 4.92 0.35 
partcp 0.11 0.61 


El ..weight 0.148 
01% Val R 
mean 0.68 0.26 
;r^ii|Mliiii meanIO 3.92 0.28 
piiiMPiSI partcp 0.17 0.94 


|36| ..weight 0.498 
03% Val R 
mean 3.38 1.28 
meanIO 10.99 0.77 
partcp 0.37 1.65 



FIG.20B 
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SCALABLE PARALLEL ALGORITHM FOR FIG. 3 is an outline of the network-partitioned on-line 

SELF-ORGANIZING MAPS WITH SOM algorithm. 

APPLICATIONS TO SPARSE DATA MINING nG 4 ^ an outUne of me dala parti tioned batch SOM 

PROBLEMS algorilhm 

TECHNICAL FIELD 5 FIG. 5 is an overview of the SP2 as a data mining 

This invention relates to a method and apparatus for platform, 

organizing and retrieving data in a parallel transaction data FIG. 6 graphically illustrates weight vectors for a model 

base - problem. 

DESCRIPTION OF THE PRIOR ART 10 FIG. 7 graphically illustrates the convergence behavior 

Recently, the importance of database mining is growing a for S0M ^ BSOM convergence, 

rapid pace by the increasing use of computing for various FIG. 8 graphically illustrates the parallel speedups for 

applications. Progress in bar-code technology has made it Retail with 16 neural nodes. 

possible for retail organizations to collect and store massive FIG. 9 illustrates the parallel speedups for Retail with 64 

amounts of sales data. Catalog companies can also collect 35 neura i noc j es 

sales data from the orders they received. A record in such _ _ A ' 

data typically consists of the transaction date, the items FIG * 10 graphically illustrates the parallel speedups for 

bought in that transaction, and possibly the customer-id if Relai12 Wllh 16 neuial nodes - 

such a transaction is made via the use of a credit card or FIG. 11 graphically illustrates the parallel speedups for 

customer card. 2Q Retail2 with 64 neural nodes. 

The self-organizing map (SOM) [T. Kohonen, The Self- FIG. 12 graphically illustrates the parallel speedups for 

Organizing Map, Prvc. IEEE, vol. 73, pp. 1551-1558, 1985; Census with 16 neural nodes. 

T. Kohonen, Setf-Ojgwizing Maps Springer, 1995] is a RG J3 hicaJ1 illustrates the parallel speedups for 

neural network model that is capable of projecting high- ^ * , £ \ v ^ F 

dimensional input data onto a low-dimensional (typically Census Wlth 16 neural nodes - 

two-dimensional) array. This nonlinear projection produces 25 FIG. 14 illustrates the population of the segments found 

a two-dimensional "feature-map" that can be useful in by me SOM method applied to Census data, 

detecting and analyzing features in the input space. SOM FIG. 15 graphically illustrates distribution of gender in 

techniques have been successfully applied in a number of the Census data set segmentation. 

disciplines including speech recogmtion [T Kohonen The nG 16 ^ Ulustrates me distribution of data by 

neural phonetic typewriter, Computer, 21 (3), pp. 11-22, 30 edition 

1988], image classification [S. Lu, Pattern classification 16 61 01 eaucauon * 

using self -organizing feature maps, in UCNN International FIG ; 17 illustrates the distribution of data by level of 

Joint Conference on Neural Networks, Newport Beach, education. 

Calif. February 1994], and document clustering [K. Lagus, FIG. 18 illustrates the statistically important inputs for all 

T. Honkela, S. Kaski, and T. Kohonen, Self -organizing maps 35 clusters. 

of document collections: a new approach to interactive FIG. 19 illustrates the detailed statistics for a single 

exploration, in Proc. Second Intl. Conf On Knowledge cluster 

ST ^ ^ ngl ? u Til ^lf ~ 24 r' AUgUSt ' FIG* 20 illustrates the statistics for one input field across 

1996]. An extensive bibliography of SOM applications is aU ^^^5 

given in T Kohonen, The Self -Organizing Map, Proc. IEEE, 40 

vol. 73, pp. 1551-1558, 1985 and is also available at T SUMMARY OF THE INVENTION 

Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen, SOM- _ . , . „ , . . 

PAK: The self-organizing map program package, Helsinki . Il * a ° ob J ect of 1 ? venUon 10 f?' 0VQ ^ & re ^ onse 

University of Technology, http://nucleus.hut.fi/nnrc/som_ time for data mmm g f ° r the V™?™, 0 * identifying clusters 

45 or mput records which are similar (that is clusters which 

Neural networks are most often used to develop models haVC C ° mm t ° Q ^ Parameters) by reducing interprocessor 
that are capable of predicting or classifying an output as a c «°^ion ™ * P™™ computer, 
response to a set of inputs to the trained network. Supervised II 15 another of lhis invention to reduce the corn- 
learning is used to train the network against input data with P*ational time in application to data sets with large numbers 
known outputs. In contrast, the SOM typically is applied to 50 of ^cords containing zeros in their mput fields, 
data in which specific classes or outcomes are not known Accordingly, this invention provides a method and appa- 
apriori, and hence training is done in unsupervised mode. In ratus for organizing data in a parallel transaction database 
this case, the SOM can be used to understand the structure which is partitioned across computational processors of a 
of the input data, and in particular, to identify "clusters" of parallel computer. With this invention each record of data is 
input records that have similar characteristics in the high- 55 represented as an n dimensional vector. These vectors are 
dimensional input space. Now input records (with the same tnen compressed by eliminating zeros in components of 
dimensionality as the training vectors) can be analyzed (and these vectors. Finally the compressed input records are then 
assigned to clusters) using the neural weights computed processed by a self-organizing map algorithm SOM to 
during training. An important characteristic of the SOM is identify groups of records having common input parameters, 
the capability to produce a structured ordering of the input 60 The invention also deals with determining statistical mea- 
vectors. This "self-organization" is particularly useful in sures of each cluster. 

clustering analysis since it provides additional insight into DESCRIPIION OF THE PREFERRED 

relationship between the identified clusters. EMBODIMENT 

BRIEF DESCRIPTION OF DRAWINGS „ u/ . , U1 „ . . . , , 

65 We describe a scalable parallel implementation of the 

FIG. 1 is an outline of the serial on-line SOM algorithm. self-organizing map (SOM) suitable for data-mining appli- 

FIG. 2 is an outline of serial batch SOM algorithm. cations involving clustering or segmentation against large 
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data sets such as those encountered in the analysis of Section 5 summarizes results for a simple model problem as 

customer spending patterns. The parallel algorithm is based well as two applications in retail data mining and the 

on the batch SOM formulation in which the neural weights analysis of some publicly available census data. Section 6 

are updated at the end of each pass over the training data. addresses the issue of interpretation of these results, show- 

The underlying serial algorithm is enhanced to take ad van- 5 mg some visualizations of results obtained using these 

tage of the sparseness often encountered in these data sets. methods. 
Analysis of a model problem shows that the batch SOM 

algorithm is at least as robus and converges at least as 2 Serial SOM Algorithms 
rapidly as the conventional on-line SOM algorithm. 

Performance measurements on an SP2 parallel computer 10 . ™ e , re K " val ? f ants & K °honen. ™"S* V° u 
are given for two retail data sets and a publicly available set 10 havent heard abou the self-organizing map, Pwc IEEElnL 
of census data. These results demonstrate essentially linear C f£ f™ 1 Networks, San Francisco, 1147-1156, 
speedup for the parallel batch SOM algorithm, using both a of ,he f 0 * 4 . ,wo ° f ^discuss in tins sect.on. 
memory-contained sparse formulation as well as a separate ^ ^nuoned above, the SOM produces a nonlinear map- 
implementation in which the mining data is accessed 1S P m « from " n-dimens.onal input space to a regular two- 
directly from a parallel file system. We also present visual- " dimensional lattice of nodes. We assume a set of input 
izations of the census data to illustrate the value of the v " 1 ™ x , e ' *v a ^? clate h a ™ g " ? ^ 
clustering information obtained via the parallel SOM e R ' k , =1 ' ■ • • . K. ™lh each of K neural nodes arranged m 
memo( j 3 regular two-dimensional (rectangular) lattice. We mtro- 

^ * . . . /» . . . . . duce a discrete time index t such that x(t), t-0,1, .... is 

Data mining is the process of obtaining previously ™ . . , . . , > /4 x • • V. 

i • c c , j i_ j • ■ presented to network at timet, and w*(t) is the weight vector 

unknown mformation from very large databases and using it , . . ^ . ' , . \ ' 4 to . . 

, - . , . j r* o ■ o compound at time t. The available input vectors are recycled 

to make effective business decisions E. Simoudis, Reality . . r , J 

, . f , . .. r , rrrr # . ... . 0 _ J . durmg the trammg or learning process; a smgle pass over the 

check for data mmingUEEE Expert: Intelligent Systems and • *, t , . & kt-.i i r .u • u. 

. . . .. . «r i-j * u mnn t .u . ^ c in P ul data set is an epoch. Initial values for the weight 

their Applications, 26-33, October 19961. In the context of , . • j a i . i i . ^ 

j , . . ' . ' t \ , , c vectors can be assigned randomly or taken equal to K 

data mining segmentation or clustering is used to identify -><: . . T r\-a * oAw * i 7 *• 

c i ■ , . t_ - . i ■ „ ^ different input records. Different SOM implementations are 

groups oi records m a database which are mathematically , c , . , . * , • . . 

* *i l j i tt n_ . ,c u >^ . . . l defined by the method used to update the weight vectors 

similar based on known attributes (fields) associated with h " t 

these records. For example, each record may represent a - , J * T . <rAw 

* *;*u * u . * i i_ 2.1 On-Line SOM 

customer account, with attributes such as historical purchase ... , „ ,. „ « a ,„ , 

, . • j » , 4 i_ c j In the conventional on-line or flowthrough method, 

patterns, demographic data, and other account-specific data -m . L ■ Li * j . j • t r 

u- u i * . .« i i . 4 o t. j JU the weight vectors are updated recursively after the presen- 

which characterize the behaviour of this customer. Such data ... °7 u- « . j i_ • i_. . 

. tation of each mput vector and each weight vector is 

sets can contain several hundred attributes, and analysis are h- 

increasingly inclined to use larger numbers of records in the 

actual model construction in order to avoid concerns about ^(O-lWO-^tolf ■ CO 

sampling of the input data set. Parallel processing, coupled 35 

with high-speed 10 to drive these applications, is an essen- Nexl ' ^ winning or best-matching mode (denoted by 

tial component in delivering the application turnaround subscript c) is determined by 

required by data mining and users. d^mimt^t)- (2) 

In this paper, we develop a scalable parallel version of the 

SOM algorithm suitable for clustering applications that arise 40 Nole that we su PP ress the implicit dependence of c on 

in the emerging field of data mining. Data mining is the discrete time t. The weight vectors are updated using 

process of obtaining previously unknown information from WO«(0M«0-^01 0) 
very large databases and using it to make effective business 

decisions [E. Simoudis, Reality check for data mining: IEEE where a(t) is the learning-rate factor, and h cJt (t) is the 

Expert: Intelligent Systems and their Applications, 26-33, 45 neighborhood function. The learning-rate factor controls the 

October 1996]. In the context of data mining, segmentation overall magnitude of the correction to the weight vectors and 

or clustering is used to identify groups of records in a is reduced monotonically during the training phase. The 

database which are mathematically similar based on known neighborhood function controls the extent to which w^t) is 

attributes (fields) associated with these records. For allowed to adjust in response to an input most closely 

example, each record may represent a customer account, 50 resembling w c (t), and is typically a decreasing function of 

with attributes such as historical purchase patterns, demo- the distance on the 2D lattice between nodes c and k. We use 

graphic data, and other account-specific data which charac- the standard Gaussian neighborhood function 



terize the behaviour of this customer. Such data sets can 
contain several hundred attributes, and analysis are increas- 



hJfi-cM-Vrrtflom, (4) 



r c denote the coordinates of nodes k and c, 

model construction in order to avoid concerns about sam- respectively, on the two-dimensional lattice. The width o(t) 

pling of the input data set. Parallel processing, coupled with of the neighborhood function decreases during training, 

high-speed 10 to drive these applications, is an essential from an initial value comparable to the dimension of the 

component in delivering the application turnaround required lattice to a final value effectively equal to the width of a 

by data mining end users. 60 single cell. It is the procedure which produces the self- 

The paper is organized as follows. Section 2 describes the organization or topology preserving capabilities of the 

serial implementations of the SOM algorithms, including an SOM: presentation of each input vector adjusts the weight 

enhancement to accommodate the sparse structure often vector of the winning node along with those of its topologi- 

encountered in retail data mining applications. Section 3 cal neighbors to more closely resemble the input vector. The 

describes several different approaches to parallellization of 65 converged weight vectors approximate the input probability 

these methods, and Section 4 describes the specific imple- distribution function, and can be viewed as prototypes 

mentations on the target SP2 scalable parallel computer. representing the input data. 
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The serial on-line algorithm is summarized in FIG. 1. records after binary expansion of these fields. The weight 

2.2 Batch SOM vectors, in general, will not be sparse since non-zero weight 
The SOM updates given by Eq. (3) are "on-line" in the components can occur for any field with at least one non- 
sense that the weight vectors are updated after the presen- zer0 entry in data set. 

tation of each input record. In the batch SOM algorithm [T. 5 T°e batcn S0M algorithm can be modified in a straight- 

Kohonen, Derivation of a class of training algorithms, IEEE forward manner so that only operations against non-zero 

Trans. Neural Networks 1, 229-232, 1990; T. Kohonene, in P ul f,e,ds m performed. Eq. (6) is written to the form 
Things you haven't heard about the self-organizing map, 

Proc. IEEE Int. Joint Conf. Neural Networks, San Francisco, , 4 r-, , vr , v „ , xl A . v < 8 ) 

1147-1156, 1993; F. Mulier and V. Cherkassky, Self- 10 = ^(OkW-^M + g^W. 
organization as an iterative kernel smoothing process. Neu- 
ral Computation, 7. 1141-1153, 1995], the weights are 

updated only at the end of each epoch: where the first summation is over the non-zero components 

of the input vector x(l). The second summation in Eq. (8) is 

1'=*, (5) 15 independent of t, and hence is computed and stored at the 

Yj ^aii'W) beginning of each epoch. The numerator of Eq. (5) can also 

_ f'=' 0 be reduced to a computation involving only the non-zero 

^(/y) _ -_ , fields. Use of this formation reduces the computation 

Yj C) from O(K-n) to 0(K-n-f^ tfro ) where f rumte „ is the fraction 

''=*o 20 nonzero fields in the input data. Note that the computation 

of the winning cell in Eq. (7) remains the same, but this 

, A j . _i . „ r • L _ , computation is 0(K), not 0(K-n)). 

where i 0 and ydenote the start and finish of the present The overall sparse batch SOM algorithm is similar to that 

epoch, respectively, and w^) are the weight vectors com- showQ io nG 2 ^ ^ ^ b ^ (g) afld ^ 

puted at the end of the present epoch. Hence, the summa- 25 (5) ^ b {{s Analogues ideas [R. 

lions are accumulated during one complete pass over the Natarajan> Exploratory data analysis in large sp arse datasets> 

Tted tin' WlDnmS D ° 31 ^ preSeatatl0D * C ° m ' IBM Research Report RC 20749, IBM Research, Yorktown 

pu e using. Heights, New York, 1997], have also been developed for the 

^('HWO-^'a)!! 2 ' (6) conventional on-line SOM algorithm. 

J , v . , t „ /x „ 30 3 Parallel SOM algorithms 

In general, parallel implementations of neural-network 
where w^tj are the weight vectors computed at the end of training are developed by either partitioning the network 
the previous epoch. The neighborhood functions h cJt (t) are among the processors (network partitioning) or by partition- 
computed from Eq. (4), but with the winning nodes deter- ing the input data across processors (data partitioning). In 
mined from Eq. (7). This procedure for computing the 35 network partitioning, each processor or parallel task must 
neighborhood functions is identical to the Voronoi partition- process every training record using the part of the neural 
ing discussed in [F. Mulier and V. Cherkassky, self- network assigned to it. In data partitioning, each processor 
organization as an iterative kernel smoothing process. Neu~ trains a full copy of the network using only the input records 
ral Computation, 7, 1141-1153, 1995]. As in the on-line assigned to it. 
method, the width of the neighborhood function decreases 40 3.1 Network partitioning 

monotonically over the training phase. FIG. 2 illustrates the A number of authors [K. Obermayer, H. Ritter, and K. 

batch SOM algorithm. Shulten Large-scale simulations of self-organizing neural 

The batch SOM offers several advantages over the con- networks on parallel computers: applications to biological 

ventional on-line SOM method. Since the weight updates modeling, Parallel Computing, 14:381^404, 1990; C H. 

are not recursive, there is no dependence upon the order in 45 Wu, R. E. Hodges, and C. J. Wang, Parallelizing the self- 

which the input records are presented. In addition to facili- organizing feature map on multiprocessor systems, Parallel 

tating the development of data-partitioned parallel methods, Computing, 17(6-7) 821:832, September, 1991; M. 

this also eliminates concerns [F. Mulier and V Cherkassky, Ceccarelli, A. Petrosino, and R. Vaccaro, Competitive neural 

Learning Rate Schedules for Self-Organizing maps, Proc networks on message -passing parallel computers, Concur- 

12 th IAPR International Conference on Pattern Recognition, so rency: Practice and Experience, 5(6), 449-470, 1993; C V. 

Jerusalem, Volume II, Conf B, 224-228, 1994] that input Buhusi, Parallel implementation of self-organizing neural 

records encountered later in the training sequence may networks., V. Felea and G. Ciobanu, editors, Proceedings of 

overly influence the final results. The learning rate coeffi- 9* Romanian Symposium on Computer Science '93,51-58, 

cient a(t) does not appear in the batch SOM algorithm, thus November, 1993] have implemented network -partitioned 

eliminating a potential source of poor convergence [M. 55 parallel methods for the SOM algorithm. The major advan- 

Ceccarelli, A. Petrosino, and R. Vaccaro, Competitive neural tage of this approach is that it preserves the recursive weight 

networks on message-passing parallel computers, Concur- update show in Eq. (3), and hence produces exact agreement 

rency: Practice and Experience, 5(6), 449-470, 1993] if this (within round-off error) with the serial algorithm. FIG. 3 

coefiScient is not properly specified. shows the network-partitioned implementation of the basic 

2.3 Sparse batch SOM 60 on-line algorithm from FIG. 1. This algorithm is written in 
As mentioned above, we are applying the SOM method- the usual Single Program Multiple Data (SPMD) program- 

ology to data mining problems involving a potentially large ming model [W. Gropp, E. Lusk, and A. Skjellum, Using 

number of attributes such as spending in pre-determined MPI: Portable Parallel Programming with the Message- 

categories. Such data sets often contain a large fraction of Passing Interface, MIT Press 1994], with each task of the 

zero entries because most records do not contain spending in 65 parallel application executing the algorithm shown in FIG. 

a large fraction of the categories. Fields containing categori- 3. We use calls from the industry-standard Message Passing 

cal variables (e.g. occupation) also generate sparse input Interface (MPI) [W. Gropp, E. Lusk, and A. Skjellum, Using 
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MPI: Portable Parallel Programming with the Message- facilitate comparison with Eq. (9), we rewrite Eq. (5) in a 
Passing Interface, MIT Press, 1994], to indicate interpro- similar form: 
cessor communication. Note that the loops over nodes are 

partitioned across parallel (MPI) tasks, and that communi- (10) 

cation (using the MPI_Allgather collective communication 5 £ ^(OWO-h^o)] 

routine [W. Gropp, E. Lusk, and A. Skjellum, Using MPI: _ ''=n> 

Portable Parallel Programming with the Message-Passing (,/) ~ M tfo) + * 

Interface, MIT Press, 1994], is required at every iteration in £ Kktf) 

order to determine the winning node. Once all tasks know 

the winning node, each can update the weight vectors 10 

associated with neural nodes in its partition. This commu- „ . r _ /rix , , , 

• 4 - ,. - 4 „ , i u-i * tf * Comparison of Eqs. (9) and (10) demonstrates that the 

nication limits parallel scalability because it introduces a , t . a**. j . i u • j • *• 

. . j • j . uj. -r batch SOM update rule can be obtained from existing 

latency-domina^d constant overhead at the processing of network _ panitioned a i gori thms by a specific choice of the 
each input record. Results shown in Section 5.4 confirm the learning-rate coefficient: 
limited scalability of this method, even when applied to 15 
problems with relatively large numbers of input fields and l 
neural nodes. ofo) - 



"7 



3.2 Data partitioning 

Data partitioned algorithms [R. Mann and S. Haykin, A 

parallel implementation of Kohonen feature maps on the 20 
Warp systolic computer, Proc. Int. Joint Conf Neural 

Networks, Vol. II, 84-87, Washington D.C., January, 1990, Hence, we have shown that the batch SOM method 

M. Ceccarelli, A. Petrosino, and R. Vaccaro, Competitive provides an alternative means of specifying the learning-rate 

neural networks on message-passing parallel computers, coefficient in previous data-partitioned learning rules. Note 

Concurrency: Practice and Experience, 5(6), 449^70, 25 that $ the neighborhood functions h,* is specified as a 

1993, G. Myklebust and J. G. Solheim, Parallel self- Kronecker delta funcUon, i.e. 
organizing maps for applications, Proceedings of the IEEE 

International Conference on Nerual Networks, Perth, /^(f^/ 1 CBk 

Australia, December, 1995, and P. Ienne, P. Thiran, and N. 1 0 otherwise, 

Vassilas, Modified self-organizing feature map algorithm for 30 

efficient digital hardware implementation, lEEETransac- ^ fc effec , ^ (n) is e ^ M ^ u ^ 

Hons on Neural Networks, Vol. 8, No. 2 315-330, 1997] tQ oyer ^ nuffl ^ r of ^ no<Je k S WQn durin y 

offer the potential for much greater scalability since the re cent epoch 

parallel granularity is determined by the volume of data, 0ur data-partitioned parallel method is based on the batch 

which is potentially very large. However, application to the 35 S0M update met hod given by Eq. (5). The parallel imple- 

on-line SOM algorithm requires that we relax the strict mentation is shown in FIG. 4, where the sparse analogs as 

requirement that the weights be updated at every iteration as discussed in Section 2.3 are used. Note that the input records 

in Eq. (3). For example, if the weights are updated only at have been evenly distributed across parallel tasks. Each MPI 

the end of each epoch, the delayed-update form of the task processes only the input records assigned to it, and 

on-line algorithm takes the form [P Ienne, P. Thiran, and N. 40 accumulates its contributions to the numerator and denomi- 

Vassilas, Modified self-organizing feature map algorithms nator in (5). After each task has completed its local 

for efficient digital hardware implementation, IEEE Trans- accumulation, MPI collective communication (MPI_ 

actions on Neural Networks, Vol. 8, No. 2, 315-330, 1997] Allreduce) is used to combine the local sums and place the 

results in all tasks, each task then performs and identical 

(9) 45 computation of the new weights using Eq. (5). Note that this 

n»(; / ) = m(/ 0 ) + a(fo)^ ^(OWO-»»('o)]. algorithm has much coarser granularity than the network- 

r'-r 0 positioned algorithm in FIG. 3 since interprocessor commu- 
nication occurs only after [N fecon£j /N laJ j tr ] records instead of 
after every record. Unlike previous data-partitioned parallel 

where, as before, \ 0 and ^denote the start and finish of the 50 implementations, this approach does not involve specifica- 

present epoch, and h cJt (t) is defined as in Eq. (5). Two tion of either a learning-rate coefficient or the frequency with 

potential disadvantages of this approach are (1) the parallel which the weights are updated. 

version now differs from the serial result (and, in general, is 4. Implementation on the SP2 Scalable Parallel Computer 

dependent on the interval between weight updates), and (2) l n this section is reviewed the essential architectural 

the stability of the method (like its serial counterpart) 55 features of the target parallel machine, followed by a 

depends on the choice of a(t) (see: P. Ienne, P. Thiran, and description of the specific implementations of the algorithms 

N. Vassilas, Modified self-organizing feature map algo- described in the preceding section, 

rithms for efficient digital hardware implementation, IEEE 4.1 Overview of the SP2 

Transactions on Neural Networks, Vol. 8, No. 2, 315-330, The IBM RS/6000 SP System is a scalable distributed- 

1997, and R. Mann and S. Haykin, A. parallel implementa- 60 memory multiprocessor consisting of up to 512 processing 

tion of Kohonene feature maps on the Warp systolic nodes connected by a high-speed switch. Each processing 

computer, Proc. Int. Joint Conf Neural Networks, Vol II, node is a specially packaged RS/6000 workstation CPU with 

84^87, Washington, D.C., January, 1990]. local memory, local disk(s) and an interface to the high 

The batch SOM method, on the other hand, does not suffer performance switch. The SP2 Parallel Environment supports 

from either of these drawbacks: the serial implementation 65 the Message Passing Interface (MPI) [W. Gropp, E. Lusk, 

updates the weights only at the end of each epoch and the and A. Skjellum, Using MPI: Portable Parallel Program- 

learning-rate coefficient a(t) does not appear. In order to ming with the Message-Passing Interface, MIT Press, 1994.] 
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for the development of message passing applications. The 
SP2 Parallel 10 File System (PIOFS) permits user-specified 
striping of data across multiple PIOFS server nodes. A 
parallel relational database system (DB2 Parallel Edition of 
DB2PE) is also available on the SP2. 5 

The results in this paper were obtained on a 16-processor 
SP2 system with 66 MHz "thin** processing nodes, each 
having a 64 KB data cache, 128 MB memory, and 2 GB local 
disk. Each CPU can perform 4 floating-point operations per 
cycle, giving a peak performance rating of 264 MFLOP/s 10 
per processing node. Measured at the MPI (application level, 
the high-speed switch on this specific machine provides up 
to 48 MB/s of point-to-point bandwidth, with a message- 
passing latency of about 40 microseconds. Both the proces- 
sor speed and the sustained interprocessor communication 15 
bandwidth are faster for more recent SP2 systems [IBM 
RS/6000 SP System, http://http://www.rs6000.ibm.com/ 
hardware/largescale], and we would expect the performance 
reported here to increase accordingly. 

FlfiifcL shows a high-level view of the SP2 as a data 20 
mining platform. Application server nodes communicate 
with each other using MPI via "user-space" communication 
across the switch. These same application servers also 
communicate over the switch with parallel file servers and 
parallel database servers using a high-speed IP protocol that 25 
is slower than the user-space protocol but still much faster 
than IP across a typical local area network. 

Parallel 10 can be obtained from an input data file that has 
been partitioned onto the processors* local disks, or from the 
Parallel 10 File System, or from DB2 Parallel Edition. (A 30 
parallel extract from DB2PE to PIOFS for performance 
reasons is also possible.) The first approach is simplest but 
least flexible. There must be exactly as many parallel tasks 
as there are partitions, and the mapping of tasks onto 
processors is preordained by the partitioning: if part 1 of the 35 
data file is on the local disk of processor 1, then task 1 must 
always run on processor 1. 

With the PIOFS parallel file system, the input is truly a 
single file rather than a multiplicity. PIOFS supports user 
specified striping of data across multiple PIOFS server 40 
nodes, accessed by PIOFS clients that are called by the MPI 
tasks. The clients can be open different views of the striped 
data so that, for example, it is easy to switch for reading 
records in round robin fashion to reading them in large 
contiguous chunks without having a rewrite the data. A 45 
single client can access multiple servers, a single server can 
serve multiple clients, or multiple clients can access multiple 
servers. The number of servers and clients can be equal or 
not, and they can be co-resident on the same processor or 
mot. A 16- task parallel program -PIOFS can be used as a 50 
regular unix file system. 

DB2PE was not used in the applications discussed in this 
paper. 

4.2 Specific Implementations 

The 16-node SP2 has PIOFS clients and servers installed 55 
on all nodes. In all applications, the input training data was 
striped in equal -size blocks across all 16 PIOFS servers, and 
the same physical nodes were also used to execute the 
parallel applications. Our data mining runs use N^^l, 2, 
4, 8, or 16 parallel application tasks. Each application task 60 
opens the training-data file such that all tasks can simulta- 
neously read blocks of data from different PIOFS servers. 
For example, for N (ttjtr «2, the first application task will read 
data from the first 8 PIOFS servers, while the second parallel 
task will read data from last 8 PIPFS servers. 65 

We report results for three different parallel implementa- 
tions: 
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Network-partitioned SOM: This is the basic parallel 
on-line SOM algorithm shown in FIG. 3. The complete 
training data file is read from PIOFS during every epoch. 
Each task reads a distinct block of data, and hence the read 
operation proceeds in parallel. However, the network- 
partitioned algorithm requires that each task "see" every 
input record, so a collective communication operation 
(MP_Allgather) is performed so that each task has a copy 
of the data blocks read by all of the tasks. These records are 
then processed as shown in FIG. 3. 

Data-partitioned BSOM: This is the data-partitioned 
batch SOM algorithm shown in FIG. 4. In this 
implementation, each task reads its training data from 
PIOFS during every epoch, and processes them as shown in 
FIG. 4 using standard non-sparse batch SOM formulation. 

Data-partitioned sparse BSOM: This is the data- 
partitioned batch SOM algorithm shown in FIG. 4, using the 
sparse formulation described in Section 2.3. In this 
implementation, each task reads its training data at the first 
epoch, compresses out the zero entries, and then stores the 
compressed data in memory. The compressed data for an 
input vector consists of the non-zero entries, plus pointers to 
their original locations in the input vector. All subsequent 
accesses to training data are directly to this compressed data 
structure in memory. 
5 Numerical Results 

All simulations reported in this section used standard 
exponentially decreasing functions for the learning- rate 
coefficient a and a width (a) of the neighborhood function: 

ain c )=a 0 [—\ c ' 
Wo ' 

Where n e is the current epoch, and N e is the total number 
of epochs, and 
00=0.1 
0^=0.005 

a^O.2. 

Note that a(n e ) and o(n e ) are held constant over epoch 

Unless otherwise stated, all simulations used 25 epochs. 
5.1 Model Problem Analysis 

We begin with analysis of a simple synthetic problem in 
order to compare results obtained with the on- fine SOM and 
the batch SOM formulations. Consider a unit square con- 
taining a uniformly-spaced 16x16 grid of 256 input vectors 
x=(xn, x 2 ) with Xj^/32, Vi2 t . . . , %, for x=l,2. (See FIG. 
6). We use this data to train a 4x4 two-dimensional SOM. In 
the absence of boundary effects, we expect the weights 
vectors ^-(wb 1, coj to converge to the geometric centers 
of a 4x4 "super-mesh" imposed on top of the 16x16 input 
mesh, i.e^aj.-aVs, 3 /a, 5 /s, 7 /s for i=l,2. We compute the average 
quantization error as the average Eclidian distance between 
each input record and its closest weight vector: 

I i=a« 



For the problem just described, it can be shown that the 
converged weight vectors should produce an average quan- 
tization error E^-5/512. 
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effective MCUP/s = 



training time 



and N„, 



10 



15 



FIG. 7 shows the average quantization error at each epoch 
for runs with 10, 20, and 40 epochs using the conventional 
on-line SOM and the batch SOM (BSO-NI). Note that io 
each case, the BSOM converges noticeably faster to 
than the on-line SOM. The weight vectors after 40 epochs 5 
are shown in FIG. 6; the BSOM vectors show better con- 
vergence to the expected positions at the cluster centroids. 

5.2 Description of the Application Problems 

We consider three realistic applications of the parallel 
SOM methods. The first, two data sets denoted Retaill and 
Retail2, are proprietary spending data: the columns in each 
record represent (continuous) spending in that attribute. The 
third data set, Census, is publicly available data from the 
"adult" database from the Machine Learning Repository at 
the University of California, Irvine [University of California 
at Irvine Machine learning databases, ftp://ftp.ics.uci.edu/ 
pub/machine-learning-databases/]. This problem consists of 
6 continuous and nominal (or categorical) attributes for a 
total of 14 input fields. (As discussed at [University of 
California at Irvine, Machine learning databases, ftp:// 
ftp.ics.uci.edu/pub/machine-learning-databases/], this is 20 
actually a classification problem, with one additional output 
or classification field; we have excluded the classification 
field, along the field "fnlwgt" from our SOM analysis, 
leaving 13 active fields.) 

the nominal fields in the census data (e.g. education) were 
expanded using binary mapping: given C possible values 
(e.g. masters, doctorate, . . . ), this field is expanded to C new 
columns, all with an entry of 0 except for a 1 in the column 
corresponding to the matching field value. Table 1 summa- 
rizes the characteristics of the 3 data sets. Note that the 
census data expands from the 13 original attributes to 103 
fields after the binary mapping. Retaill contains 99984 
records and 272 fields, but only [0.0287*99984-272] of the 
input fields have nonzero spending. The fraction of nonzero 
entries for the census data set is based on the data set after 
binary expansion. 

5.3 Serial Performance 
Table 2 summarizes performance of the algorithms 

described in Section 4.2, executed serially on a single SP2 
processor. Performance is given in terms of millions of 
weight connection updates per second (MCUP/s); for all 
algorithms (including the sparse BSOM implementation), 
this is computed as 



25 



35 



40 



45 



where N^,^ is the number of fields after binary expansion, 



=K is the number of neural nodes. We do not 50 



impose any cutoff on the Gaussian neighborhood function 
[Eq. (4)]; for the on-line SOM algorithm, this means that we 
update all weights at every record, while for batch SOM, we 
accumulate contributions in Eq. (5) for all weights. The 
effective MCUP/s for the sparse batch SOM are determined 55 
using the same expression for the number of weight updates 
to facilitate comparison of absolute computing time with the 
other algorithms. 

The results in Table 2 show that the batch SOM is slightly 
faster than the on- fine SOM since it does fewer floating point 60 
operations per update. The sparse batch SOM is particularly 
effective for Retaill, since this data set is the most sparse. 
For 16 neural nodes, the sparse BSOM is nearly 14 times 
faster than the conventional BSOM, but for 64 nodes, the 
performance improvement drops to a factor of 8.6. The 65 
difference in the performance of the sparse BSOM for 16 
and 64 nodes in Retaill appears to be due to the fact that the 



smaller problem (16 nodes) can be held in the high-speed 
cache of the SP2 processor during a single iteration, thus 
minimizing the number of cache misses. It is interesting to 
note that the 49.35 effective MCUP/s for the 64-node Retaill 
problem is equivalent to [0.0287-49.35>1.41 actual MCUP/ 
s, which is approximately 4 times slower than the non-sparse 
BSOM for the same problem. The increased time per actual 
weight update in the sparse BSOM is due to the loss of 
pipelining efficiency in the SP2 (RS/6000) processor when 
executing loops with the indirect addressing required in the 
sparse case. For Retail2, the cost is less, but we see that the 
sparse BSOM is only slightly more efficient than the con- 
ventional BSOM even though the data has a nonzero frac- 
tion of 0.4231. 
5.4 Parallel Performance 

We report parallel performance for the 3 algorithms in 
terms of speedup as a function of the number of application 
tasks relative to their respective serial performance given in 
Table 2. Hence, the absolute performance (MCUP/s) of the 
parallel versions is given by the product of the parallel 
speedup and the absolute serial performance from Table 2. 

FIGS. 8 and 9 show speedups for the 16- and 64-node 
Retaill problems using the 3 different parallel implementa- 
tions. The network-partitioned SOM method shows essen- 
tially no speedup for the smaller 16-node problem, and 
achieves only a speedup of slightly less than 4 to 8 tasks 
before tailing off at 16 tasks. As discussed in Section 3.1, this 
behavior is easily explained by the interprocessor commu- 
nication overhead incurred at the processing of each record; 
the performance is somewhat better for 64 nodes because the 
fixed overhead is a smaller fraction of compute time in each 
task. 

The data-partitioned BSOM methods show excellent scal- 
ability for Retaill. It is interesting to note that the non-sparse 
BSOM method achieves better speedup than the sparse 
BSOM. This is because the data partitioning was done by 
allocating equal numbers of records to each application task, 
and for the sparse BSOM method, this can lead to load 
imbalance because the data processed by different tasks may 
have different sparsity ratios. This load imbalance does not 
occur in the non-sparse SOM because the computation is 
done for all input fields regardless of whether or not they are 
zero. For this problem with 16 application tasks, the load 
imbalance limits the maximum parallel speedup to 14.0, we 
measures a speedup of 13.2 for the 64-node sparse BSOM 
run. Note that essentially linear speedup is observed in the 
data-partitioned BSOM run for 64 nodes. As described in 
Section 4.2, this method is reading the input data from the 
parallel file system at every epoch; the excellent scalability 
at the application level confirms that reading the input data 
is not limiting scalability of the training run. 

FIGS. 10 and 11 show similar analysis for the Retail2 
problem. Network-partitioned results (not shown) for this 
problem show no speedup because the number of data fields 
is not large enough to amortize the interprocessor commu- 
nication overhead. The amount of data in this problem 
shows some loss of parallel efficiency due to the MPI__ 
Allreduce operation shown in FIG. 4. The speedup curves 
for the 64-node problem look very similar for both methods 
because the load imbalance in the sparse BSOM method is 
smaller than io Retaill. 

The speedup curves for the Census problem are shown in 
FIGS. 12 and 13. The load imbalance is negligible here for 
the sparse BSOM method; the reduced scalability for the 
sparse BSOM method for 16 nodes is due to the fact that the 
sparse BSOM computer rate is approximately 4 times faster 
than the non-sparse-BSOM for this problem, and hence the 



07/08/2003, EAST Version: 1.03.0002 



US 6,260,036 Bl 



13 



14 



interprocessor communication (MPI-Allreduce in FIG. 4) 
has a larger impact on the parallel speedup. Both methods 
show linear speedup for the larger 64-node problem. 
6 Interpretation of SOM Results 

Speeding up the SOM technique is only worthwhile if the 
resulting method produces useful, readily interpreted results. 
In this section, we present the interpretation of results 
obtained for the Census data set described in the proceeding 
section. Additional discussion of methods used to visualize 
and interpret the Retail2 data set can be found in [H. 
Rushmeier, R. Lawrence and G. Almasi, Case study: visu- 
alizing customer segmentations produced by self -organizing 
maps, submitted for publication]. 

Clustering or segmentation uses unsupervised training to 
identify groups of records which are mathematically similar 
in the input data space. One use of this information in a 
business context is the development of different (i.e. 
targeted) marketing strategies for each cluster or segment 
depending on the characteristics of the segment. A self- 
organizing feature map with K neural nodes immediately 
defines a useful segmentation: the records "closest" to node 
k [in the sense of Eq. (2)] form a single segment with a 
centroid (in the original n-dimensional input space) defined 
by the converged weight Vector to*. Furthermore, the self- 
organizing property of the SOM provides additional insight 
into the relationships between these clusters: records asso- 
ciated with neighboring nodes on the map will exhibit a 
greater similarity than records associated with non- 
neighboring nodes. This information can be used to combine 
records in adjacent nodes on the map to form larger "super- 
clusters" for additional marketing analysis. 

Our SOM analysis of the "adult" (Census) database 
[University of California at Irvine, Machine learning 
databases, ftp://ftp.ics.uci.edu/pub/machine-learning- 
databases/] used a concatenation of the "training" and "test" 
data sets, omitting any records containing fields with 
unknown values. The resulting data set contained 45,216 
records, which we analyzed using the data-partitioned batch 
SOM method with 64 neural nodes arranged on the square 
two-dimensional map. The classification field ("class") was 
not used as an input variable to the SOM analysis, but is used 
below in the characterization of the resulting segments. The 
interpretation of the general trends in the clustering is aided 
by visualizing the resulting distribution of the various 
attributes across the SOM grid. 

FIG. 14 shows the relative population of the various 
segments, along the out system of numbering the segments. 
The segment population ranges from a low of 97 records for 
segment 58, to a high of 1868 for segment 63. On the right, 
FIG. 14 shows the distribution of the class field, which is a 
binary representation of whether or not income is greater 
than $50,000. Although the class field was not use to train 
the SOM, we see that the records have been organized so 
that there is a clear pattern in the distribution of the class 
field. The marker for each segment is sized according to the 
ratio of the fraction of records with greater than $50,000 a 
year income in this segment to the fraction of records with 
greater than $50,000 in the whole population. For example, 
in segment 63, 1395 of 1868 records have income greater 
than $50,000, or about 75%. In the whole population, only 
about 215% of the records have income greater than $50, 
000. by contrast only 14 of 1419 records, or about 1%, in 
segment 1 have income greater than $50,000. The distribu- 
tion of income is relatively smoothly distributed on the SOM 
grid — with higher incomes near the upper right of the grid 
and in the center of the bottom of the grid. 

The segmentation can be further interpreted by consider- 
ing the distribution of other attributes. FIG. 15 shows the 
distribution of male and female on the grid, with the marker 
sized according to the fraction of the segment that is male 
relative to the fraction that is male in the overall population. 
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Generally the bottom three rows of the grid are female, and 
the top of the grid is male. The concentration of higher 
incomes in the upper right of the grid then are male with high 
income, while the concentration in the center of the bottom 
of the grid are female with high income. 

On the right, FIG. 15 shows the age distribution. Each 
marker is sized by the ratio of the average age in the segment 
to the average age of the population. The grid shows 
generally higher age to the right and lower to the left. Both 
females with high income (around segment 5) and males 
with high income (around segment 63) have are bout the 
average population age (about 40 years old). Neighboring 
segments 30 and 22, 31 and 23, and 32 and 24 have similar 
age distributions, while the triplets 30, 31, 32 and 22 and 23, • 
and 24 have the same gender. Comparing the neighboring 
same age segments with different genders show that the male 
segments have consistently higher income (e.g. 30 has 
higher average income than 22 etc.) 

FIG. 16 shows the multivalued attribute for level of 
education as a pie chart. Each wedge in the pie represents a 
level of education, with high school and below in the lower 
semi-circle, and beyond high school in the upper semi- 
circle. Each wedge is shaded by the fractional representation 
of that attribute value has in the segment, relative to the 
fractional representation of that attribute value in the whole 
population. For example, in segment 57 all of the records list 
HS-grad (high school graduate) for education type, so that 
wedge is shown in black, and all of the other wedges are 
shown in white. By contrast segment 8 has as distribution of 
all the various education levels. While this display is more 
complex, trends are still evident. Segments 36, 37, 44, and 
45 all have concentrations of education levels of high school 
and below, and these segments also correspond to low 
income levels. The segments with high income levels tend to 
have higher concentrations in the higher education levels. 
The distribution of education levels for females with high 
income (segment 5) and males with high income (segment 
63) are nearly identical. 

Finally, FIG. 17 shows the distribution of marital status 
across the grid. The pie representation for this multivalued 
attribute is the same as for education level. Generally, 
married with civilian spouse is represented on the right of 
the grid and never married is more highly represented on the 
left of the grid. The never married attribute is most highly 
concentrated in the areas with lower age and lower income. 

In general then, the visualization of attributes across the 
SOM gives insight into the characteristics of each segment, 
and the distribution of attributes across the whole popula- 
tion. Clearly, combining demographic data such as that used 
in this segmentation with spending data from a commercial 
data ware house would give valuable insight into a compa- 
ny's customer database. Segments, or groups of neighboring 
segments, could be selected and marketing campaigns tai- 
lored for these segments according to their demographics 
and spending interests. 
7 Summary and Conclusions 

In this paper, we have developed a data-partitioned par- 
allel method for the well-known self-organizing feature map 
developed by Kohonen. Our approach is based on an 
enhanced version of the batch SOM algorithm which is 
particularly efficient for sparse data sets encountered in retail 
data mining studies. We have demonstrated the computa- 
tional efficiency and parallel scalability of this method for 
sparse and non-sparse data, using 3 data sets, two of which 
include actual retail spending data. Model problem analysis, 
plus visualizations of the segmentations produced for pub- 
licly available census data have shown that the batch SOM 
methodology provides reasonable clustering results and use- 
ful insights for data mining studies. Algorithms similar to 
those discussed in this paper are planned for inclusion in a 



07/08/2003, EAST version: 1.03.0002 



US 6,260,036 Bl 



15 



future release of the IBM Intelligent Miner [IBM Intelligent 
Miner, htip:/Avww.software.ibm.com/data/intelb'-niine] data 
mining product. 



The attached notes describe the design for a software tool which we use 
to analyze the results of a clustering operation performed using the Self- 
Organizing Feature map (SOM) algorithm. The clustering operation 
identifies groups of records which are similar based on their input 
variables or fields. The object of the post-processing is to understand the 
characteristics of the clusters in order for the overall results to be 
useful to data mining and users. We need to understand how the records 
that are assigned to specific clusters differ from the average behavior of 
all the records in the original data set WE do this for each cluster by 
identifying which input variables for the records in the cluster differ the 
most from the database average. This determination is made by sorting the 
fields according to various ratios as described below in the design notes. 

The software tool draws a view (FIG. 1) of the SOM displaying the 3 
fields which server to differentiate each cluster from the average over 
the database. Another view (FIG. 2) displays additional information 
(described in design notes) for the 10 most significant fields. Another 
view (FIG. 3) shows the SOM with information on a chosen input field 
which is used to determine how the field contributed in each of the 
clusters on the SOM. 

Original design notes: ('segment*, as used here, is same as 'cluster*) 
/ 



SUMMARY OF KMAF DESIGN (3/12/98): 
DEFINITIONS: 

(1) Input data from Intelligent Miner results object: 

records [segment] = nunber of records in segment 

values! segment][field] = sum °f values for fields over all records 
in segment 

NZvalues[segmenl]lfield]= number of records in segment with non-zero 
values in field 

NZ implies non zero 

(2) Computed quantities from input data: 

meanNZvalf segment]! field] values[segmentlfield] records[segment] 
aieanNZval{segment]I field] valuesi segment! field] recordslwg 111601 ] 
mcanNZvaI(segment][ field] NZvaluesfsegmentlfield] records[segment] 

NZ implies non^zero value 

Parte implies participation (i.e., >0 value) 

(3) Ratios background (R implies ratio) 

Rrecords[segment] =records[ segment] /records [BKGD] 

Rvaluefscgmentfl field] -valuc[segment]I field] /value[BKGD] 

[field] 

RmeanX[segment][ field] -meanXlsegmentlfield] /meanXIBKGD] 

[field] 

(X can be 'Value* 'NZvaP 'Parte') 

(4) Note: 

segment can be BKGD. In which case data is returned for entire dataset 
field can be TOTALS. In which case data is returned as sum over fields 

QUERIES 



(1) seq <row> <col> 

returns list of fieldnames, sorted by field according to 
One of the following sort_modes; 

weighttsegmentlfield] -> SORT__MODE - 'weight* 

RvBlue[segment][ field] -> SORT_MODE - *V 

RmeanValue{segmentI field] -> SORT_MODE - 'meanV* 
RMeanNZvaI[segment][field] -> SORT_MODE - 'meanN' 
RmeanPartc{segmenlI field] -> SORT_MODE - 'meanP* 

row <0 or col <0 implies segment — BKGD 
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-continued 



(2) 



field <fieldname> 

returns Kohoncn map showing one of the following quantities: 



10 



we ightfsegmentj field] 
Rvalue[scgmcm][ field] 
Rmean*value[ segment \ fie Id] 
RMeanNZvaI[segmentIfield] 
RmeanPartcf segment J field] 
Rrecordsf segme nt] 



--> MAP_MODE = 'weight' 
--> MAP_MODE = *V 
--> MAP_MODE - 'meanV' 
-> MAP_MODE - 'meanN' 
--> MAP_MODE - 'meanP' 
--> MAP_MODE - 'records* 
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(3) sort <SORT_MODE> 

sets the SORT_MODE as defined above 

(4) map <MAP_MODE> 

sets the MAP MODE as defined above 

(5) summary 

prints top 3 fields (according to SORT_MODE) in each segment 

(6) show <fie!dname> <row> <col> 
returns data for this field in this segment 

(7) seqdist <rl> <xl> <r2> <c2> 
returns distance between two segments 



Format of .res file: 



1?? 

nb_buckets min max 
min max suml sum2 
25 record splits 

min max suml sum2 
record splits 
min max suml sum2 
record splits 



//total population 
//total population 
//total population 
//segment 0 
//segment 0 
//segment 1 
//segment 1 
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record split is defined as (assuming nb_buckets 0-12) 
totalFreq binO binl . . . BinlO binll binl2 + //freqs 
totalFreq binO binl . . . BinlO binll binl2 + //suml 
totalFreq binO binl . . . BinlO binll binl 2 + //sum2 
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TABLE 1 



Description of the application 


data sets 






Retaill Retail2 


Census 


Number of records 


99984 126282 


45216 


Number of input data fields 


272 


14 


13 


Number of input data fields 


272 


14 


103 


(after binary expansion) 








Fraction of non-zero input 


0.0287 


0.4231 


0.1239 


data fields 








Size of input data file (MB) 


207.5 


13.5 


35.5 


TABLE 2 


Effective MCUP/s for the serial algorithms 




Neural 








nodes Retaill 


Retail2 


Census 


On-line SOM 


16 5.70 


3.41 


5.53 




64 5.62 


4.56 


6.96 


Batch SOM (BSOM) 


16 5.92 


4.15 


6.82 




64 5.72 


5.31 


6.17 


Sparse batch SOM 


16 83.14 


6.67 


34.46 




64 49.35 


7.08 


22.10 



Having thus described our invention, what we claim as 
new and desire to secure by Letters Patent is: 

1. A method of organizing data in a parallel database 
65 which is partitioned across computational processors of a 
parallel computer, said data being organized into a plurality 
of records, said method comprising: 
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a. representing each record as an n dimensional vector; 

b. compressing each n dimensional vector by eliminating 
zeros in components of each said n dimensional vector 
for each of said records; and 

c. applying a modified self-organized map algorithm to 5 
operate on the non-zero components of each vector for 
said compressed input records to group said records 
into a plurality of clusters, wherein each cluster com- 
prises a plurality of said records having a set of 
common input parameters. 

2. A method as recited in claim 1, wherein said zeros are 10 
deleted each time each record is accessed. 

3. The method of claim 1 wherein said applying of a 
modified self-organizing map algorithm comprises invoking 
the equation 

is 

n 

2k t0 = Z x> {t)[Xii0 - ^('qh ■*• Z 



4. A method of retrieving data from a parallel database 
which is partitioned across computational nodes of a parallel 
computer said data being organized into a plurality of 
records, said method comprising: 

a. representing each record as an n dimensional vector; 

b. compressing each n dimensional vector by eliminating 25 
zeros in components of each said n-dimensional vector 
for each of said records; 

c. applying a modified self-organizing map algorithm to 
operate on the no-zero components of each vector for 
said compressed input records to group said records 30 
into a plurality of clusters, wherein each cluster com- 
prises a plurality of said records having a set of 
common input parameters; and 

d. determining statistical measures of each record that is 
retrieved by using said input parameters associated 35 
with one of said clusters, said one cluster being the 
cluster from which said record was retrieved. 

5. A method as recited in claim 4, wherein said statistical 
measures are determined by comparison of input parameters 
of corresponding clusters. 
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6. The method of claim 4 wherein said applying of a 
modified self -organizing map algorithm comprises invoking 
the equation 

Mt) = £ XiMfoW -2uii(r 0 )] + YjO&Uq). 

7. A program storage device readable by a machine, 
tangibly embodying a program of instructions executable by 
said machine to perform method steps in a parallel transac- 
tion database which is partitioned across computational 
processors of a parallel computer said data being organized 
into a plurality of records, said method comprising the steps 
of: 

a. representing each record and n dimensional vector; 

b. compressing each n dimensional vector by eliminating 
zeros in components of each said n dimensional vector 
for each of said records; and 

c. applying a modified self-organized map algorithm to 
operate on the non-zero components of each vector for 
said compressed input records to group said records 
into a plurality of said nodes, wherein each group 
comprises a plurality of said records having a corre- 
sponding set of common input parameters. 

8. The device of claim 7 wherein said step of applying a 
modified self -organizing map algorithm comprises invoking 
the equation 

n 

3t (t) = Yj *.-Mfc W - 2"i;('o)] + Z 

x;*0 1=1 



* * * * * 
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